



The 
PSYCHOLOGICAL 


RECORD 


Vol. I NOVEMBER, 1937 No. 22 














HAZARDS AND FALLACIES OF 
STATISTICAL METHOD IN 
PSYCHOLOGICAL 
MEASUREMENT 


JOHN Gray PEATMAN 





Sov" zPS 
: I 
é 
g 








The Principia Press, Inc. 
Bloomington, Ind. 






Price of this number, 50 cents 





EDITOR MANAGING EDITOR 


J. R. Kantor C. M. Louttit 
Indiana University Indiana University 
DEPARTMENTAL EDITORS 
Abnormal Experimental 
Edmund S. Conklin, B. F. Skinner, 
Indiana Minnesota 
Child Medical 
Helen Koch, Norman Cameron, M. D., 
Chicago Johns Hopkins 
Clinical Physiological 
B. M. Castner, C. F. Scofield, 
Yale Buffalo 
Comparative Psychometrics 
E. A. Culler, J. P. Guilford, 
Illinois Nebraska 
Educational Social 
J. G. Peatman, Norman C. Meier, 
CCNY. Iowa 


Editorial Assistant, J. W. Carter Jr., Indiana 





The Principia Press, Inc., hep eatin Go witusivn of Gis cregentien juuut 

afford the di publication at the least possible e: 

present low cost of go By and possible future reductions, depend entirely cy agen, the 

number of subscriptions. The subscription price has been made low to induce indi 

subscribe. Under the Articles of Incorporation of the Principia Press no profit can be made 

on any of its publications. Therefore an increase in the number of subscribers will be re- 

re ee ee ee ee eee ee 
EDITORIAL BOARD. The above named board of associate editors have agreed to 

tees aan coals ten 4 cede oll ae quien, cell ab cummenians Gl be alkadael Yo Game. 
MANUSCRIPTS may be sent to the appropriate associate editor or to Dr. J. R. Kanor. 

Se See OF Se ae Sees Sew & Sy ee eens Coe on ey 








COSTS. Te cx, ot eiteaien wih be C4) ome ome of eoomeinedly 30) wate. 
For tabular matter or illustrations in excess of 10 per cent of the total nae of Se aaa 
there will be an excess charge of 75 cents per page. An additional 50 cents will be charged 
for each paper for the cost of covers. 

REPRINTS. One hundred copies of each paper will be supplied gratis. Additional copies 
may be purchased in any quantity at a rate which will be set when the author is informed 
of the cost of publication. 

ROYALTIES. Fifty per cent of the net income Same Co ode BRE cate, or Som 
copies sold as part of back volumes, will be credited as royalties to the author’s account. 
Royalties cannot be paid on income from subscriptions current in the year of publication. 

SUBSCRIPTIONS. The subscription price is $4.00 per year for an annual volume of 

ximately 500 pages. Individual: papers will be sold st an advertised price, which will 
a a oe) Ge ae Foreign subscriptions will be $4.50. 

CORRESPONDENCE concerning editorial matters should be addressed to Dr. J. 

i should be addressed to Dr. C. M. 





Kantor, 
pee ne = Ny ony Bloomington, 











HAZARDS AND FALLACIES OF STATISTICAL METHOD* 
IN PSYCHOLOGICAL MEASUREMENT 


JoHN Gray PEATMAN! 
College of the City of New York 


Statistical method in psychological measurement is an invaluable 
aid, first, in the organization and summarization of numerical results, 
and, second, in the comparison of these results as summarized. The 
immediate purposes of such organization and comparison are usually 
twofold: the measurement of human abilities and the inquiry into 
and possible discovery of essential relations between measureable 
manifestations of psychological phenomena. 

Most fundamentally, we of course aim in psychological measure- 
ment to obtain results which will validly serve as evidence for gen- 
eralizations about human nature. It should be obvious that the valid 
interpretation of statistical results is dependent upon a knowledge of 
what is measured as well as of the conditions of measurement, and 
that this dependence becomes greater as we attempt to establish 
propositions of greater generality. We wish to generalize but un- 
fortunately generalizing is harassed by hazards and fallacies, ever 
present for the untutored or the unwary. 


Vigilant attention to the role of error and fallacy in psychological 
measurement cannot be overemphasized. Current psychological liter- 
ature having recourse to statistical method too often bears witness 
to the importance of this statement, especially in the frequent misuse 
of probable error estimates and in the tendency to generalize about 
everything on the basis of statistical results having little or no rele- 
vance to anything. It seems to us appropriate, therefore, to bring 
together and summarize briefly the chief hazards and fallacies which 
we should try to avoid. Their avoidance will certainly be an easier 
task if we have them out in the open where we can see them clearly. 


1In publishing this manuscript, the author wishes to acknowledge the 
inspiration and influence of Drs. Morris R. Cohen and Ernest Nagel, as 
derived from their writings as well as from association with them in teaching 
and discussion. 


* Manuscript accepted by Dr. J. R. Kantor, September 24, 1937. 


365 








366 JOHN GRAY PEATMAN 


A. THE HAZARDS OF GENERALIZING 


From our knowledge of groups of individuals, we often generalize 
about persons or groups which we have never examined nor ever 
hope to deal with directly. Now our concern here is that of apprais- 
ing the validity of generalizations made in relation to the statistical 
results of psycholegical measurement. In other words, we are con- 
cerned with the problem of evaluating propositions or statements 
about the nature of psychological phenomena in the light of such 
evidence as we think we may have for them. This is, strictly speak- 
ing, a matter of material truth and logical implication, rather than a 
psychological problem, although the psychological process of reason- 
ing is of course indispensable. It is the problem of determining to 
what extent the results of measurement are a fair sample of all the 
instances about which we wish to make a generalization. Most of 
the hazards and fallacies of statistical inference are, in one form or 
another, but an aspect of this general problem. We shall therefore 
treat it first and then proceed in the second section to summarize 
some of the common, specific hazards and fallacies of statistical inter- 
pretation. 


The Problem of a Fair Sample 


A real problem in psychological measurement is to determine the 
limits within which we may establish valid generalizations about 
individuals or groups of individuals on the basis of the measurements 
of sample populations and sample tasks or instruments. The problem 
lies in trying to determine the validity with which a given set of 
results is fairly descriptive of unmeasured instances. We shall treat 
this problem in some detail by considering sample populations. 


Fair Population Samples 


A fair population sample is a group of subjects which in all 
essential respects is representative of the total population of individ- 
uals about which a generalization is made. If we were completely 
ignorant of what constitutes a fair sample in psychological measure- 
ment, then increasing the number of subjects in a sample, or using 
different samples and comparing the results of one sample with those 
of another, would be the only methods available by which we might 
attempt to establish fair population samples. Actually, of course, we 
know some of the essential conditions about human beings which 
will make for unfair samples when the nature of total populations is 
not suitably delimited. It is fortunate that this is the case, since 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 367 


merely adding instances to a sample, as is frequently done, does not 
at all guarantee a population sample of increasing fairness. 


Fair Samples by the Method of Homogeneous Limitation 


If we select samples of subjects that tend to be homogeneous in 
certain characteristics, we can reduce the constant errors which might 
otherwise affect our results by limiting our generalizations to popu- 
lations also homogeneous, or similar, in the chosen characteristics. 


We may illustrate the bare essentials of this method of establishing 
fair samples by reference to an example in physics. If we wish to 
get a fair sample of water to be used in determining the temperature 
at which water freezes, we take a sample homogeneous in respect to 
the factor of purity, i.e., a sample free of impurities. We do this in 
order to eliminate the errors that might arise from taking heterogene- 
ous samples, i.e., water containing varying and unknown substances. 
We obtain a pure sample by careful distillation. And since distilled 
water is not necessarily a fair sample of non-distilled water, any 
generalization about the freezing point will be limited to water which 
is homogeneous in respect to purity. Or, by definition, the concept 
water might be delimited to instances of distilled water. Now, if we 
did not know that the temperature at which water freezes will vary 
with changes in atmospheric pressure, there would be a constant 
error in measurements made only at sea level, in so far as such 
‘measures might be taken as a fair sample for any altitude. In order, 
therefore, to eliminate this constant error in sampling, we might do 
either one of two things: (1) take our sample as homogeneous in 
respect to pressure conditions and limit our generalizations about the 
temperature at which water freezes to other instances taken under 
the same pressure conditions; or (2) experimentally determine 
whether there is a constant relation between freezing point and pres- 
sure changes. If we discover a constant relation, then our generali- 
zation about the freezing point can be expressed as a known function 
of pressure—at least within the limits of the range-of pressures used. 

The method of homogeneous limitation cannot of course be as 
successfully applied to psychological phenomena as to physical be- 
cause (1) the number of factors making for heterogeneity, i.e., vari- 
ability, in human organisms is much greater and (2) many or most 
of the factors may be indeterminate or unknown. However such 
matters are relative—the point being that it is possible by the method 
of homogeneous limitation to establish samples of subjects which will 
tend to be more fair for a given psychological problem than if the 








368 JOHN GRAY PEATMAN 


method were not used at all. Although constant errors in population 
samples cannot be entirely eliminated, they can successfully be re- 
duced. 


Factors making for Heterogeneity 


In applying the method of homogeneous limitation to a given psy- 
chological problem, we need to select a sample of subjects which will 
be similar in respect to only those traits or psychological functions 
which are known to be, or are thought to be, relevant to the prob- 
lems of the investigation. In general, there are two classes of factors 
which are thought to make for heterogeneity in individuals. There 
are ancestral (or hereditary) factors. And there are the externally 
operating factors of individual development—cultural, social, and 
educational factors, as well as factors of the physical environment. 


Chronological Age and the Problem of Fair Population Samples 


The longer an individual has lived the greater the number and 
complexity of determining factors that will have operated in his de- 
velopment. Consequently, a group of individuals of varying ages is 
in general much more heterogeneous in its psychological functions 
than a group of individuals of about the same age. In trying to 
reduce the effect of constant errors in population samples, a group 
of subjects is therefore taken as relatively homogeneous in age for 
all psychological investigations in which variability in age is known, 
or believed to make a difference. General propositions inferred from 
the study of a homogeneous age group are therefore limited to other 
individuals similar at least in age. Homogeneouos age grouping 
does not of course eliminate all of the constant errors arising in popu- 
lation samples, but it is evident that an important source of error 
tends to be avoided. 


The Genetic Fallacy 


If we generalize about adult human beings from the results of 
studies made on infants or children, our propositions will be fallacious 
to the extent that infants or children do not tend to be fair samples 
of adults. Whether, in fact, they do tend to be fair samples depends 
on the nature of the proposition and the groups themselves. Natur- 
ally we would not expect them to be a fair sample in all things— 
in wisdom and motor skills, for example. They may be a fair sample 
in regard to some propositions about digestion and the circulation 
of the blood. They are often taken to be fair samples with respect 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 369 


to some aspects of the learning process. As a matter of fact, white 
rats and other infra-yhuman organisms have been similarly taken as 
fair samples in respect to the nature of some aspects of the learning 
process in adult human beings. 

In so far as such samples do not tend to be fair, ie., in so far as 
infants or animals are not similar to adult human beings in respects 
essential to the type of proposition under consideration, generaliza- 
tions about adults will be fallacious. This kind of invalid inference 
is usually characterized as the genetic fallacy. Its occurrence has 
often been revealed as the result of a direct psychological study of 
the adult human being himself. 


Anthropomorphism and Enelicomorphism 

The converse of the genetic fallacy may arise if we make generali- 
zations about infants or animals on the basis of observations derived 
from a sample of adult human beings. This type of invalid inference 
has occurred so frequently in psychology and the biological and 
social sciences that two forms of it have been given distinctive names. 
One, the fallacy of anthropomorphism, occurs in so far as adults 
are not fair samples of animals. The other, the fallacy of enelicomor- 
phism?, occurs in so far as adults are not fair samples of infants 
and children. The recognition of this latter type of invalid infer- 
ence—the recognition of some of the types of propositions in refer- 
ence to which it is apt to occur, works effectively in establishing 
parent-child and teacher-child relations on a sounder psychological 
basis. A parent or a teacher who expects a child of five years of age 
to have the standards of honesty and deportment, typical of himself 
or of his culture, is sowing seeds of maladjustment, of discontent 
and repression, or of discontent and malicious behavior, in the child. 


Cultural, Social and Educational Factors and the Problem of Fair 
Population Samples 

Individuals of different cultures, social groups; and educational 
opportunities are obviously not homogeneous in respect to many of 
the determining factors of their development. Differences in temper- 
ament, in assertiveness or submissiveness, may be a function of differ- 
ent cultures. Differences in the way individuals approach the prob- 
lems of life’s situations, in the crassness or subtlety of their efforts 
to succeed, may be a function of different social backgrounds. Differ- 


2 This term was introduced by the late Professor Warren of Princeton 
University. See his Dictionary of Psychology, 1934, Houghton Mifflin, Cam- 
bridge, Mass. 








370 JOHN GRAY PEATMAN 


ences in individuals’ knowledge and motor skills may be a function 
of different educational opportunities. And the older individuals 
are, the more these many dete-mining factors have had a chance to 
operate in their development. 

In view of the thousands of determining conditions in development 
which are a function of the kind of world in which individuals 
grow, it may be apparent that we can not expect to establish sample 
groups of subjects which will be, on the average, homogeneous, in 
all these things. The real problem is to determine the limits within 
which we may establish valid propositions about groups of individuals 
from the results of given samples. Having sample groups of subjects 
and reliable measurements of their activities, we see that the real 
problem lies in determining the validity with which our inferences 
from their results are descriptive of other groups of individuals. 
Errors in population sampling will increase as we extend our generali- 
zations to groups of different opportunities, in so far as such differ- 
ences make for variability in the kinds of psychological functions 
studied. 

Consider, for example, the Terman Revision of the Binet-Simon 
intelligence test. It is often used as if it provides a measure of 
intelligence which is relatively independent of differences in culture, 
social background, and educational opportunities. Most psychologists 
recognize, of course, that its comparative value as a measure of 
intelligence is limited to subjects having certain conditions of homo- 
geneity in these things,—subjects with similar opportunities to learn 
the English language for naming, describing, and counting; to become 
acquainted with a sample of objects taken as common to our culture 
(keys, coins, knives, watches, pencils, pictures of many other objects) ; 
to learn the distinctions of our calendar year; to calculate, to read, 
and to write; to learn to tie knots, etc., as well as similar opportuni- 
ties to learn habits of social rapport such as will enable those subjects 
(who can acquire such habits) to adapt themselves readily to the 
testing situation with a high degree of interest and confidence. In 
so far as individuals have not developed in relation to the determining 
conditions of such a culture, their intelligence level is erroneously 
measured by this test. They are not members of that group for 
which the intelligence test measures may truly have much validity. 


Ancestral Factors and the Problem of Fair Population Samples 


There seems to be little doubt that differences in heredity somehow 
make for heterogeneity or variability in many human characteristics, 
and that increased similarity in heredity makes for a degree of homo- 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 371 


geneity in many traits. The range and nature of hereditary deter- 
minants which make for individual differences, for heterogeneity, in 
somatic and psychological characteristics are not known. But that 
they are important is evidently well-established. Differences in the 
sheer ability of individuals to learn, to respond immediately to sig- 
nificant relations, are a partial function at least of their differences 
in ancestral conditions. Differences in somatic characteristics, such 
as skin and eye color, texture of hair, etc., seem to be mainly a 
function of such determining conditions. Similarly, the physiological 
differences between males and females are evidently a function of 
these kinds of factors. 

So far as we know, however, differences in skin color, color of 
hair and eyes, etc., are not reliable signs of corresponding differences 
in psychological characteristics. Thus, the selection of a group of 
subjects fairly homogeneous in skin color, or texture of hair, does 
not give a group of individuals that is homogeneous in its ability to 
learn. 

The method. of homogeneous limitation is generally applied to the 
problem of variability in ancestral conditions by using large and 
random samples of subjects with relatively similar ancestors. The 
criterion for similarity is usually taken as skin color or geographic 
background, i.e., the residence locality of ancestors. Whether such 
selection makes a group more homogeneous in respect to the operation 
of hereditary factors in psychological variability is, however, problem- 
atical. 


Special Factors and the Problem of Fair Population Samples 

A group of individuals is often selected as homogeneous in respect 
to characteristics which are especially relevant to a particular psy- 
chological investigation. Thus, in the standardization of an intelli- 
gence test, the attempt is made to obtain a group of subjects which 
will do its best on the tasks presented. Propositions based on their 
results are therefore limited to other individuals homegeneous in this 
respect. Or, in studying visual and auditory perception, the attempt 
is made to select a group which has at least no obvious impairment 
of the receptor organs involved. 

Practically all psychological investigations call for a population 
sample selected as homogeneous in respect to certain factors, specially 
relevant to the particular type of inquiry. The important point here, 
perhaps, is that such selective factors should be noted and any general- 
izations about unsampled populations should be limited to persons 
who are similarly homogeneous. 











372 JOHN GRAY PEATMAN 


1. THE FALLACY OF SELECTION 


The preceding account may be brought together as follows: if 
we fail to limit or confine our generalizations to groups of individuals 
which tend to be positively analagous in all essential respects to our 
measured population, we make invalid inferences which are usually 
called fallacies of selection. Or to state the matter in another way: 
if our generalizations about unmeasured populations are not based 
on a measured population which tends to be a fair sample of the 
former, we are erring in our reasoning, and to the extent that our 
sample is unfair, our generalizations are fallacious. The genetic 
fallacy, anthropomorphism and enelicomorphism are instances of this 
general fallacy of selection. 


One of the great changes in psychological research during the 
past fifty years has been its increased utilization of statistical method 
and appropriate reasoning processes for the avoidance of this fallacy 
of selection. Whether intentionally or whether inadvertently, many 
of the propositions of the old psychology were based upon a few 
instances selected to illustrate or “prove” them. The early history 
of phrenology as developed by Gall and Spurzheim is an excellent 
example of this. Apparently these men erred inadvertently in 
selecting their cases to demonstrate that the so-called mental faculties 
correlate with the conformation of the skull. Their failure to avoid 
the hazards of selection was more excusable in their time than would 
be the case now, since the nature and operation of this fallacy are 
more widely recognized today. Unfortunately, however, many gen- 
eralizations are still made in the psychological literature with little 
or no regard to the operation of this fallacy. Many of the proposi- 
tions of “general psychology”, for example, are possibly peculiar to 
selected cultures and not representative of homo sapiens in general. 
As Klineberg has pointed out, “The books on ‘general psychology’ 
in present use should more accurately add the subtitle ‘western 
Europe and America’, since their generalizations have rarely been 
checked in other parts of the world. Even when the problem is 
recognized, little is done about it." The problem is, of course, one 
of fair sampling, as well as of competent observation and measure- 
ment, and the procedure of attack should consist in obtaining samples 
of different cultures in order to check those generalizations which 
are suspect. The theory of human motives, their alleged biological— 
versus—cultural nature, is a case in point. The culturally deter- 


3 Klineberg, O. Race Differences. 1935, Harper, New York, p. 256. 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 373 


mined form of many of our biological urges or drives is lost sight of 
when our studies of human behavior are confined to cultures homo- 
geneous in relevant respects. Margaret Mead’s work, “Sex and Tem- 
perament™*, and the works of other cultural anthropologists have 
brought out this fact most strikingly. 


2. THE GROUP-TO-INDIVIDUAL FALLACY 


Another fallacy rather common to our generalizations about human 
nature may occur in asserting that a particular individual has the 
qualities or characteristics found to be typical of the group of 
which he is a member. In other words, what has been found to be 
true of the group, taken as a whole, or on the average, is asserted 
to be characteristic of individual members. When such assertions 
are invalid, we have instances of the group-to-individual fallacy. 
And we may note that, when it occurs, this fallacy often arises out 
of the fact of individual variability in a given psychological quality 
or function. Thus, on the average, Negro children in the Southern 
States obtain lower I. Q. scores than do white children, on the aver- 
age, and of the same locality. This does not necessarily mean, 
however, that a Negro child chosen randomly from his group will 
have a lower I. Q. score than a white child similarly chosen from his 
group. There is no necessity here inasmuch as the overlapping of the 
scores of the two groups is known to be great. Only in the case two 
groups had no overlapping in the scores of its respective members 
could we validly infer that each individual of the one would have a 
higher (or lower) score than each individual of the other. Or, 
again, the fact that more than a majority of 16 year old boys and girls 
in this country have had the opportunity to attend school through 
the eighth grade does not warrant the inference that a 16 year old, 
chosen at random, will have had the educational opportunities of the 
eighth grade graduate. 

The group-to-individual fallacy also frequently occurs as the result 
of interpretations given correlation coefficients. The product-mo- 
ment correlation coefficient (r) is a summary statement of the degree 
of co-variability which, on the average, is characteristic of the pairs 
of individual scores in two distributions of variable measures. Again 
we are dealing with averages and it is only as coefficients approach 
1.00 or —1.00 that we tend to have reliable information about the 


4 Mead, M. Sex and Temperament. 1935, Morrow, New York. 








374 JOHN GRAY PEATMAN 


co-variance of particular members. There may be a correlation of 
.60 between the intelligence test scores and the scholastic records of a 
large group of individuals, but such co-relationship does not at all 
warrant the generalization that an individual chosen randomly from 
the group will be mediocre, or superior, in both his test score and his 
grade rating. 

Using groups of children of the same age, a number of psycholo- 
gists employing different samples have found a correlation of about 
.10 between measures of height and intelligence test scores. Some 
have interpreted such a result as implying that there is a tendency 
for tall children of a given age group to be brighter than the shorter 
ones. Of course the assertion is not made that this relationship is the 
case for particular individuals, but the generalization is often im- 
plicitly made that it is so, on the average. Such a low degree of 
concomitant variation warrants no such inference. In fact, it is 
dificult to understand what meaning can be given to the term 
“tendency” for such low correlative relations. They are so non-in- 
dicative of co-variance as to be negligible in value. Even for corre- 
lations of .90 or .99, the margin of error in the making of an indi- 
vidual predictive estimate is too great to allow for a valid, categorical 
proposition about a specific member of the group. 

Not all generalizations about human nature, however, are subject 
to this group-to-individual fallacy. Assertions that a person will 
manifest fundamental psychological qualities, such as the capacity to 
learn and remember, to perceive and to entertain beliefs, etc., may 
have a truth value approaching certainty, although we may have 
no previous or specific information about the particular individual. 
The chances of an error in reasoning in such instances are apt to be 
slight because these generalizations are assertions about the qualities 
of that class of beings we call homo sapiens. To be identified as a 
human being is tantamount to possessing these qualities. It is when 
our generalizations take the form of assertions of the degree to which 
a particular individual may manifest a given psychological quality 
that we are particularly apt to make a fallacious inference. 


3. THE FALLACY OF PIGEON-HOLING 


This fallacy is analogous to compounding a felony. It is both 
the fallacy of selection and the group-to-individual fallacy, and it 
occurs daily in social intercourse. We meet a person and on the 
basis of his name, or his skin color, or the texture of his hair, or the 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 3795 


shape of his eyes, we pigeon-hole him into a class af beings which we 
believe invariably possesses certain qualities or traits. Not only 
are we making a judgment about the person on the basis of our 
beliefs about the group with which we identify him, but also our 
beliefs about the group are apt to be based upon an unfair or per- 
verted sampling (if they are based on anything other than hearsay). 
Thus we may entertain ideas about the nature of Chinamen as ap- 
prehended from motion pictures; we meet a man, identify him as a 
Chinaman, and then immediately pigeon-hole him into that narrow 
mental compartment of our ideas about his race. 

The employment of our ideas in such a manner makes for social 
attitudes which tend to become crystallized and play a tremendous 
role in determining the nature of social relations. Such attitudes are 
of great importance in the efficacy of propaganda and education. 
Walter Lippman in his work, “Public Opinion”, gave them special 
treatment and described them as “stereotypes”. Once we identify, 
or have identified for us, a man as a pacifist, a politician, a com- 
munist, a capitalist, or a college professor, our thinking processes 
are apt to click just once. The man is thereupon pigeon-holed and 
the occasion for further perceptual scrutiny or reasoned thought is 
past. A great labor-saving type of behavior, but unfortunately it 
usually compounds the fallacies of inference, to say nothing of the 
social products. The person who employs statistical method as a 
tool in psychology should learn to employ it with sufficient sagacity 
that he can at least avoid these fallacies in the interpretation of his 
own research data. One of the greatest services statistical method in 
research has yielded to psychology and the social sciences has been 
to lay bare many of the stereotypes of our thought, as well as to 
explode that age-old tendency of asserting personality traits to be 
either-or and all-or-none affairs. For example, we are now able 
to demonstrate the falsity of such assertions as the following: people 
are honest or else they are liars; they are bright or they are dumb; 
they are kind or they are vicious, etc. 


4. HAZARDS AND FALLACIES IN PROBABILITY 
ESTIMATES 


Probability estimates, as worked out in terms of the standard error 
or probable error of statistical measures, are based on the assumption 
that obtained measures are not perfectly true measures because the 
error-factors operating are chance, rather than constant error deter- 
minants of the results. Such estimates thus indicate the limits within 








376 JOHN GRAY PEATMAN 


which the measures will most probably lie in so far as 
chance error determinants are concerned. However, these estimates 
do not in themselves give any information whatsoever about the 
kind or quality of samples employed. They do not allow for or take 
cognizance of the effects of constant sources of error in measure- 
ment. Failure to recognize this fact has repeatedly led to fallacious 
inferences about the reliability or generality of a measure. 


We shall illustrate the hazards of interpreting probability esti- 
mates, first, with respect to one individual’s scores, and then with 
respect to those of a group. 


a. The Probability of Error of One Individual’s Scores 


A person is given 36 successive trials of 60 seconds each, after 
many practice trials, on the three-hole coordination board under 
standardized conditions. If his average number of insertions per 
minute is 85, and the standard deviation of the 36 trials is 3.0, then 
the standard error of his mean score is equal to 0.5. Interpreted 
literally, the chances are about 68 in 100 that his true coordination 
test score lies within the limits of 85 + 0.5, ie., between 84.5 and 
85.5. And the chances are practically 100 in 100 that his true score 
lies within the limits of 85 + 3(0.5), which are 83.5 and 86.5. 

Such a literal interpretation has validity for a generalization about 
the subject’s ability on the test only if it is true that these 36 trials 
are truly a random sample of his capacity at this task. A truly 
random sample would be one in which each trial deviates from the 
subject’s hypothetically true capacity only for reasons of chance 
error determinants. In so far as constant error determinants affect 
the rate of his insertions, the broad generalization cannot be validly 
made. Some constant error factors that might affect such a task, 
and which possibly may operate in a subject’s performance on any 
type of task, are the following: (1) his general organic condition, 
physical and psychological, at the time of the test (this is often the 
most important source of error in individual interpretation, since it 
is more apt to be overlooked than other kinds of error); (2) practice 
or fatigue effects during the test (this is a specific variation in organic 
condition during the course of the testing, and the operation of 
such changes can oftentimes be detected by analyzing the trend of an 
individual's scores) ; (3) any continually repeated errors of procedure, 
such as in the timing or recording of his trials (such factor would 
of course need to operate in the same direction in order to be con- 
stant rather than chance error determinants). 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 377 


After checking, so far as possible, these various sources of constant 
error and allowing for them, we can avoid the fallacy of too broad 
a generalization by limiting it to the subject as he was capable of 
functioning at the time of the test. And, in practice, this limitation 
is often imposed, thereby adding “generalizations” to the psycho- 
logical literature of relatively isolated propositions. 


Our Generalizations Are Not Necessary Inferences 


The important question arises: could we ever obtain sufficient 
evidence’ to demonstrate the validity of the broad generalization, 
viz., this subject’s true coordination test score lies within the limits 
of his obtained mean score, plus and minus a small range of error? 
The answer is no, and it is important that the reasons for the truth- 
value of this negative reply be apparent, since it is characteristic 
of all such questions in statistical inference 

First, it is mathematically true that the probability estimate itself 
allows for the negative answer. Even if we might assume that our 
sampled instances are random representatives of the subject's capacity 
on the test, affected only by errors which are as apt to operate favor- 
ably or unfavorably on his mean score, the probability estimate never 
reaches the status of certainty. In other words, the probabilities 
do not become 100 in 100. The probabilities only approach cer- 
tainty; they never reach it. Interestingly enough, there is a super- 
ficial exception to this statement, conceivably arising because of the 
nature of the standard error formula. If we recall that the standard 
error of a mean score (oy) is the ratio of the standard deviation 
of the distribution of obtained measures to the square root of the 
number of measures, we see that were all measures of the same value, 
there would be no deviation, and hence oy would be zero. Such a 
result might occur if a sample consisted of only two or three trials, 
or if the units of measurement were too gross (as might be the case 
in our example if the number of insertions per minute were counted 
only in units of the nearest number divisible by 10). Such excep- 
tions do not invalidate the soundness of the standard error formula, 
since its usage is based on the assumption of at least 25 trials or 
instances and on the further assumption that the units of measure- 
ment will be sufficiently fine to reveal a variability attributable to the 
operation of chance determinants. 

In the second place, the answer to our question is no because of 
the very nature of statistical inference. Statistical inference, which 
might be defined as reasoning from samples, is a form of probable 








378 JOHN GRAY PEATMAN 


inference, in contrast to necessary inference. Our inference about 
the subject’s coordination test ability would be necessary, that is, 
strictly deductive, if we could assert premises a and b of the follow- 
ing: 

a. What is true of this subject in the 36 trials of the test situation 
is true of his capacity whenever tested. 


b. A mean score of 85 with a range in values from 83.5 to 86.5 
is true of this subject in the 36 trials of the test situation. 


Therefore, a mean score of 85 with a range in values from 83.5 
to 86.5 is a true statement of his capacity whenever tested. Obvi- 
ously the first premise is not necessarily true and, consequently, the 
conclusion does not necessarily follow. Nor could the truth-value 
of the first premise ever be established, as stated, since we cannot 
take into account unsampled, future possibilities. 


Our Generalizations as Statements of Probability 


Our generalization about the subject’s capacity thus has to take 
a different form. It has to be a statement of probability. Thus: 
the subject's true coordination test score most probably lies within the 
limits of his obtained mean score, plus and minus a small range of 
error; or, the chances are practically 100 in 100 that the subject's 
true coordination test score lies within the limits of 85 + 3 (0.5). 


But as we have emphasized, the truth value of this generalization 
is dependent upon the fairness of the sample of measures obtained. 
Are they indeed a random sample of his capacity whenever tested? 
If they are, the generalization is sound. The problem therefore is 
to determine whether the sample is probably a fair one, and the pro- 
cedure for this is to take repeated samples under given bio-situational 
conditions. The sampling will be demonstrated to be more probably 
fair if the repeated samples continue to give mean measures which 
tend to cluster about the original mean value of 85, and which tend 
to distribute about that mean in the manner of the normal probability 
distribution, with a standard deviation equal to approximately 0.5 
(the oy of the original mean). In any event, it should be apparent 
that a literal interpretation of the oy in relation to the mean of 
the sample of 36 trials is hazardous. Whether, in fact, the subject 
tends to be as relatively invariable in this capacity as the mathemat- 
ical results indicate; whether, in fact, determinate factors, constantly 
operating in a negative or a positive direction during the 36 trials, 
may be entirely absent in another sample of trials—these are ques- 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 379 


tions which cannot be given any satisfactory answer unless we have 
additional evidence. By the method of repeated sampling, the prob- 
able truth value of the original generalization can be checked. By 
this method we can arrive at an answer which may be satisfactory 
for our purposes, even though it will never be a statement of neces- 
sity. 


The Concept of a True Measure 


We have pointed out that a probability estimate is supposed to 
give the limits of error within which a subject’s true mean score 
will lie, provided the factors which make for the variability of his 
individual scores operate according to principles of chance deter- 
mination. But we know that individuals change as they grow,—as 
they grow older. A subject is variable not only during his test 
performance, for whatever reasons, but the speed and precision of his 
movements will definitely change as he matures and then as his ar- 
teries harden—as his metabolic processes change. We have no reason 
to doubt that this is one of the invariant rules of human nature. It 
is one of the qualities of all human beings. In view of this, what 
meaning can be given to the concept of a true measure? 

The following definition may be taken as typical of those custom- 
arily given: “By the ‘true’ measure of an individual’s capacity in 
any trait, as for example, the true measure of his height, reaction 
time, or intelligence, we mean the average of an infinite number of 
measurements of the given capacity made under precisely the same 
conditions.”> Following this definition, the important qualification 
is made that we can never deal with true measures, as thus defined. 
But, it is contended that we can measure the amount by which an 
obtained measure most probably varies from its corresponding true 
measure. The probability estimate, oy, is of course the measure of 
the amount of probable divergence of an obtained mean score from 
the true mean value. 

A true measure, as defined, is thus a purely hypothetical concept. 
We obviously cannot obtain an infinite number of measurements 
of anything. And it appears highly improbable that any two meas- 
urements can be made under “precisely the same conditions”, since 
conditions refer to a variable organism in a changing world. Even 
if situational, i.e., world, conditions could be held constant, the living 
organism would continue to change (it costs energy to live). 





5 Garrett, H. E. Statistics in Psychology and Education. 1926. Longmans, 
Green. New York. p. 118. 











380 JOHN GRAY PEATMAN 


The fact that this concept of a true measure is purely hypothetical 
does not mean, however, that it is valueless. From the point of 
view of the hazards and fallacies of measurement it does mean 
that we should not forget the fact of its hypothetical nature when 
interpreting statistical results in the light of probability estimates. 
When we forget, we then tend to speak of obtained measures as if 
they were practically identical (within the calculated limits of error) 
with true measures. Such thinking tends to lead to the view that our 
measures of individuals’ capacities give us practically true statements 
of their degree and nature, as of the past, present, and future,—at 
least barring accident, disease, and old age. This in turn is apt to 
lead to false predictions concerning their future behavior, or incline 
us to erroneous statements about their past behavior. A common 
example of this sort of fallacious reasoning often occurs in the inter- 
pretation of children’s I. Q. scores. They are sometimes considered 
to be practically true indices of a relatively unchanging degree of 
general ability, and predictions, all too specific, of individuals’ future 
behavior are made upon the basis of them. 


The Value of the True Measure Concept 


Although the concept of a true measure is valuable, we have 
criticized it in order to emphasize the hazards surrounding its use. It 
is now appropriate to discuss its significance in psychological meas- 
urement. Its value lies in the fact that individuals at the same time 
are changing organisms and remain relatively unchanged—that it 
is a changing world but a world of recurring features, many of which 
appear to remain unchanged. We are variable and our world is 
variable, yet at the same time we maintain an identity in a relatively 
familiar and constant environment. 

The concept of a true measure arises, then, out of the fact of 
the relative constancy of various aspects of our behavior-in-relation- 
to-environment. The concept points to habitual ways of acting—"to 
recurrent modes of interaction taking place between what we term 
organism, on the one side, and environment, on the other.”* Its 
value in measurement is that it implies the possibility of an estimate 
of the degree to which a person tends to remain relatively unchanged 
in a series of behavior-environmental transactions. And in some 
psychological functions, tested under standardized conditions, the 
individual’s level of ability does tend to remain relatively constant 


6 Dewey, J. “Conduct and Experience” in Psychologies of 1930, p. 411. 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 381 


over a period of years. Particularly is this apt to be characteristic 
of.some psychological functions, when one’s level of ability is con- 
sidered in relation to the group, rather than in relation to his own 
past performance. The individual's actual level of achievement may 
change—it does change as he grows up—but it may at the same time 
remain relatively constant in relation to changes generally character- 
istic of his age, sex, and culture group. His true measure, in such a 
case, is estimated on the basis of his deviate position in his group, for 
which standard measures (z scores) and their derivatives are usually 
employed as the means of noting the possibility of a relative con- 
stancy, a constancy that may be typical while the individual himself 
is changing. 

Whether, in fact, an individual’s level of achievement in any 
psychological function does tend to remain relatively constant, either 
in relation to his past behavior or in relation to his group, can be 
determined only by repeated samplings at different times. Such rela- 
tive constancy cannot be validly inferred from probability estimates 
as derived from a series of performances at a given time. Without 
repeated samplings, we need to impose upon our generalizations the 
limitation already mentioned, viz., a probability estimate about a 
subject’s true measure is a statement of his ability at the time of the 
test.7 


b. The Probability of Error of the Scores of a Group 


Any probability estimates about statistical measures based upon 
the scores of groups of individuals have hazards in their literal inter- 
pretation similar to those for one individual's scores. This is perhaps 
obvious, since the group’s scores are based upon the scores of indi- 
viduals. 

Two of the most common sources of fallacy in the interpretation 
of groups’ results arise in the too literal interpretation of the standard 
error of the difference between two mean measures, and in the 
standard error of a correlation coefficient. 

If a difference between two mean measures is greater than 3 times 
the standard error of the mean difference, the literal interpretation 
of this numerical ratio is that the chances are practically 100 in 100 
that the true mean difference is greater than zero. Again, we are 
dealing with the concept of a true measure, and what has just been 


7 The odd-even or split-half method of determining the reliability of a 
test is of course subject to this temporal limitation—a fact often overlooked 
in the interpretation of coefficients obtained by this inethod. 








382 JOHN GRAY PEATMAN 


said about this concept for an individual is also applicable here. The 
proposition is valid provided the obtained value of the mean differ- 
ence is affected by only chance error determinants. And, as we saw, 
whether in fact this may be the case cannot be determined from only 
one sampling. 

As for the standard error of a correlation coefficient, it will suffice 
here, perhaps, to refer to a specific fallacy sometimes characteristic 
of its too literal interpretation. When the error occurs, it is an- 
other instance of the fallacy of selection. It is a case of making 
generalizations from too small a sample of instances. 

The standard error of a product-moment coefficient (o,) is equal 
to (1—r?)/N. It is evident that as r approaches 1.0 or —1.0, the 
value of o, approaches zero. If r happens to be exactly 1.0, its 
or, by the formula, is of course equal to zero. Rarely are actual 
cofficients equal to 1.0, regardless of the kind of sampling, so long 
as it is not deliberately selected and so long as it consists of a fair- 
sized group of instances. However, correlations of .90 or .95 are 
sometimes obtained from small samples. A coefficient of .95 based 
on 25 cases, for example, has a o, equal to about .02. If this proba- 
bility estimate is interpreted literally, the chances are practically 100 
in 100 that the true value of r will lie within the limits of 95 + 3 
(.02), which are .89 and 1.00. But, as we have emphasized, such 
a literal interpretation for so small a sample, and only one sample at 
that, is too hazardous to take seriously. 


B. SOME SPECIFIC STATISTICAL FALLACIES 


1. THE FALLACY OF OVER-REFINED DATA 


This fallacy occurs when one acts upon the belief that painstaking 
and laborious calculations will somehow make up for the inadequacy 
of carelessly obtained data. Statistical manipulation does not take 
the place of poor data. 


Original measures are sometimes summarized by constants of over- 
refined values. Thus, if raw scores are in integer form, the compu- 
tation of average and deviational measures to 3 or 5 decimal places 
is not only wasted effort but also hazardous in that such results are 
apt to imply that the original data were similarly refined. For scores 
that are integers, one decimal place is usually adequate for any 
summary or comparative purposes. Correlation coefficients are ade- 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 383 


quately expressed by two decimal places, although for some analytical 
purposes, as for example, tetrad analyses, they are often expressed 
to four places. 


2. FALLACIES ARISING FROM SPURIOUS CORRELATION 


Spurious correlation occurs if some degree of co-variability is 
present between two variables which have no relevant basis for being 
related, or if a correlation is in part dependent upon some factor 
illegitimately common to two variables—variables which are other- 
wise truly related. 

Obvious examples of the former do not often occur. Their possible 
eccurrence arises out of the fact that the product-moment method of 
correlation can be mathematically applied to any series of paired 
scores. As a source of error, this hazard is not great in psychological 
measurement, since most correlations are of series of score-pairs ob- 
tained from the same individuals, or of pairs of individuals of blood 
relation, such as parent and offspring. 

The second source of spurious correlation often goes unrecognized 
and is illustrated in a number of ways. Thus, if we are interested 
in discovering whether there is any relation between average length 
of fingers and mechanical aptitude, an illegitimate source of correla- 
tion would be introduced were we to use subjects not fairly homo- 
geneous in age. Particularly would this be true if our subjects were 
children, ranging in age, say from 5 to 10 years. Spurious correla- 
tion would be introduced here because the younger children, even 
though they might be mechanically advanced for their age, would 
probably do poorly as compared to the older children, and of course 
their fingers would also be shorter. Similarly, the older children would 
tend to be in the upper range of both variables, regardless of their 
positions in their own age group. 

This source of spurious correlation emphasizes again the importance 
of having subjects relatively homogeneous in age. However, the 
effects of such heterogeneity raise a real problem in psychological 
measurement. It is the problem of determining whether results 
with one age group are a fair sample of other age groups. Unless the 
determination is made, so far as it is possible to do so, we have seen 
that we are apt to make the genetic fallacy or the fallacy of enelico- 
morphism when generalizing about other age groups or about human 
nature in general. 








384 JOHN GRAY PEATMAN 


Spurious Index Correlation® 


Yule has shown that of three uncorrelated variables, correlations up 
to .50 may be obtained if the individuals’ scores for each of two varia- 
bles are taken in ratio to their scores of the third variable. Thus, 
if the three variables are denoted as x, y, and z; and ryy, rxz, and 
Ty, are all equal to zero, then ryiy2 may take a value as high as .50, 
when v, is an index taken as the ratio x/z, and ve is a second index 
taken as the ratio y/z. The individuals’ scores of variable z are seen 
to be common to both of the new variables, v; and ve; hence the 
basis for spurious correlation. 


The most frequent example of spurious index correlation arises 
perhaps in correlating I. Q. scores, derived from two different tests, 
of children heterogeneous in age. Since the I. Q. is an index 
taken as the ratio of mental age score to chronological age, the chil- 
dren's chronological age scores are common to the two variables 
correlated. Such spuriousness is avoided if a homogeneous age group 
is used, since both variables would then be indices derived as ratios 
to a stable, rather than variable denominator. Or, for a heterogen- 
eous age group, the technique of partial correlation is often employed 
to eliminate or reduce the effect of variable age scores.® 


3. THE POSSIBLE FALLACY OF INFERRING CAUSALTY 
FROM CORRELATION COEFFICIENTS 


Strictly speaking, an obtained coefficient of correlation of itself 
tells nothing of the reasons for such co-variability as may be ob- 
tained in the relationship of two variables. In general, a measure 
of correlation possibly expresses one of three types of relations: 


8 Cf. Garrett, H.E., ibid., p. 260, and Yule, G.U., An Introduction to the 
Theory of Statistics. Griffin, London, Sth edition, 1919, pp. 215-16. 


9 The use and interpretation of the technique of partial correlation are 
hazardous in psychological measurement. The hazards arise out of the fact 
that its legitimate use is based on the assumption that the variable held con- 
stant is strictly unitary. Strictly unitary factors are possibly a rarity in 
psychological measurement since different activities of individuals are probably 
dependent upon an interrelated functioning of the whole organism. In spite 
of this, we find the technique frequently used to partial out, or “hold con- 
stant’, the “age factor” in problems where subjects are heterogeneous in this 
respect. It may be apparent that this is a very dubious practice since age 
y~ de cannot by any stretch of the imagination be signs of developmental 
or psychological differences which are strictly unitary in character. 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 385 


(1) Purely fortuitous, in which case the variables correlated are 
kriown to be independent of each other. It is mathematically pos- 
sible to obtain a degree of correlation between two sets of measures, 
the variability of each set being known to depend on determining 
factors that have nothing in common with each other. Thus, in 
tossing a pair of dice there may be some correlation between the 
face values of the pairs if we make only a hundred or so tosses, 
although the action of each die is independent of the action of the 
other. Or a degree of correlation might be obtained between the 
birth rate in California and the rain-fall in New Zealand over a 
period of years. Such a co-relation we would certainly suspect of 
being purely fortuitous. 


(2) Absolutely determined, in which case changes in one vari- 
able are known to be fully determined by changes in the other 
variable, or changes in both variables are known to be fully deter- 
mined by a set of conditions common to both. The existence of 
such relations between two sets of psychological measures is prac- 
tically impossible of proof. Where causal relations exist they are 
most probably of the next type. An example of a fully determined 
relation is afforded by the correlation between the size of the radius 
and circumference of circles varying in area. 


(3) Partially determined and partially fortuitous: If a given re- 
lation between two sets of variable measures can be demonstrated 
to be not wholly fortuitous, then to some extent, usually unknown, 
the determining conditions of one variable may possibly be in part 
similar to the determining conditions of the other variable, or 
changes in one variable will, in part, be directly determined by 
changes in the other. Thus a correlation of .68 between the weight 
and height of a fair sample of ten-year-old boys might be interpreted 
as mainly a function of similar determining factors, rather than as 
largely fortuitous, because we know that weight and height are 
both dependent on physiological conditions of the individual organ- 
isms whose pairs of measures make up the two variables. Further- 
more, from the fact that the correlation between weight and height 
is far from perfect, we may infer that all of the factors that make 
for variability in weight are not identical with all of the factors that 
make for variability in height. The point, however, which we wish 
to make is that no inference of causal relation can validly be made 
from only the statistical statement of correlation. We need to go 
beyond the measure and make an analysis of the kinds of phenomena 
with which we are dealing, as well as of the conditions under which 








386 JOHN GRAY PEATMAN 


the measure is obtained. Knowledge from other than statistical 
sources may then warrant the conclusion that, in part at least, varia- 
tions in one series of measures are causally related to variations in 
the other. 

When correlations are obtained between measures involving the 
psychological activities of individuals, inferences implying causalty 
are hazardous because our knowledge of the determinants involved 
is often madequate. Thus, if we find that there is a correlation of 
.50 between the intelligence scores of 100 pairs of siblings, and a 
correlation of .80 between the intelligence scores of 100 pairs of 
“identical” twins, we may be tempted to infer (as a number of 
psychologists have) that the greater degree of co-variability for the 
scores of the twin pairs is attributable to the operation of deter- 
minants common to their genetic constitutions. Although current 
genetic theory supports the view that identical twins have more 
similar genetic constitutions than do siblings, we do not know to 
what extent variability in measures of intelligence is attributable 
to the operation of such conditions. The possibility is not ruled 
out, for example, that the greater degree of co-variability in the 
intelligence scores of twins, as compared to siblings, is in important 
part at least, attributable to a greater homogeneity in the social 
conditions of their development. Recent research into this old prob- 
lem is afforded by Newman, Freeman, and Holzinger’s studies of 
19 pairs of identical twins reared apart.!° Whereas the Binet 
mental age scores of 50 pairs of identical twins reared together cor- 
related .92, the Binet scores of the 19 pairs of identical twins reared 
apart gave a correlation of .64. 


4 THE POST HOC ERGO PROPTER HOC FALLACY IN 
PSYCHOLOGICAL STATISTICS 


Interpreted literally, this Latin phrase of course means, “after this, 
therefore because of this.” The expression names the fallacy of 
assuming that whatever follows an event is therefore caused by it. 
Consider, for example, correlations between intelligence test scores 
of parents and offspring. The inference is often made that the 
basis for all such correlation as is found (except for the operation 


10 Newman, H.H., Freeman, F.N., and Holzinger, K.J. Twins: A Study 
of Heredity and Environment. 1937. Univ. of Chicago Press, Chicago, 369 
pp. (Cf. Table 96, p. 347). 











STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 387 


of error factors) must be the effect of parents’ intelligence on that 
of their offspring via the germ plasm. Although part of such cor- 
relation may be attributable to the biological and cultural effects 
of the parents on their children, nevertheless, part is conceivably 
attributable to the offspring, because of what they bring into the 
home from the outside world, particularly from their schools. 

This fallacy of post hoc ergo propter hoc is prone to occur in the 
interpretation of the results of psychological experimentation. It 
is now sufficiently recognized as a possibility in drug experiments 
that careful investigators use equated groups and other methods of 
control. The fact that some individuals report that they . ver 
sleep well if they drink coffee for dinner has psychological signifi- 
cance. But their insomnia is not necessarily attributable to the 
chemical properties of the coffee, although it may very well have 
something to do with the psychological act of coffee-drinking. The 
fact that other individuals report that the drinking of coffee helps 
them to sleep soundly certainly suggests that either individuals dif- 
fer radically in their susceptibility to the chemical properties of 
coffee, or they have acquired different types of delayed response to 
the act of coffee-drinking. Whether, however, any determinate 
relationships exist between these temporally related events is a prob- 
lem of careful experimental inquiry. 


5. FALLACIES ARISING FROM DIFFERENT OR CHANGED 
STANDARDS OF COMPARISON 


This fallacy may occur in many ways. It commonly arises, for 
example, in the comparison of correlation coefficients derived in 
the measurement of similar psychological functions for two or more 
groups of individuals, which differ considerably in their respective 
ranges of ability. Thus, in attempting to determine the relationship 
that may exist between scores on an intelligence test and academic 
grades of college freshmen, one psychologist reports correlations 
around .50, and another one reports correlations around .30. 
The comparison of coefficients is possibly unfair if the freshmen 
of the former groups are admitted to their college on the basis of 
a high school diploma, whereas those of the latter are admitted only 
if they have had a high school average of 80 or better, and pass 
standardized entrance examinations. The latter would probably be 
more highly selected groups, both in the sense of their level of abil- 
ity and in its narrowed range. This illustration is a special case of 





ee 


ee ee eng eee 





388 JOHN GRAY PEATMAN 


a difference in sampling. The point here is that correlation coeffi- 
cients decrease in size as the range of sampled abilities is decreased, 
provided, of course, other factors remain constant. Consequently, 
in comparing results of different investigators, it is well to ascertain 
whether their samples are drawn from similar ranges of ability. 

It has been said that there is a greater incidence of behavior prob- 
lems among school children today than was the case 15 years ago. 
Or, we hear that more people are neurotic today than ever before. 
Both statements are possibly true. On the other hand, such gen- 
eralizations will be fallacious if standards for ascertaining the inci- 
dence of such things and for comparing them have changed. Today 
we are more aware of and interested in children’s behavior prob- 
lems and of neurotic conditions in adults. Furthermore, if the com- 
parisons are made in absolute numbers instead of in percentages, an 
immediate source of error is present, since a greater number of chil- 
dren are attending schools today than 15 years ago, and since the 
population of the country has been steadily increasing. 


6. HAZARDS ARISING FROM THE MISUSE OF 
STATISTICAL TECHNIQUES 


Finally, a frequent source of error in psychological measurement 
arises from the utilization of misleading or inadequate statistical 
techniques. A large corporation recently announced that their 
employees during the past year received an average salary of about 
$1,500.00, whereas an employee group countered that their salaries 
averaged less than $1,000.00. This discrepancy possibly arose in 
the use of the arithmetic mean for the first figure (in which case 
the salaries of highly paid executives may have given weight to the 
higher number) and the median for the second figure. And, of 
course, their respective samplings of employees may have been dif- 
ferent. 

If distributions of scores tend to fulfill the conditions of the nor- 
mal probability distribution, it makes little difference, so far as 
measures of central tendency are concerned, whether we use the 
median or the mean. But in highly skewed distributions our choice 
will make a difference. We should then use the median if we are 
interested in obtaining the most typical measure. On the other 
hand, if we are interested in taking into account the influence of 
the extremely skewed scores, we should use the mean. In either 

















STATISTICAL METHOD IN PSYCHOLOGICAL MEASUREMENT 389 


event, we should point out the fact that the distribution is skewed 
and represent it graphically, if convenient. 

Generally, the nature of obtained distributions should be an im- 
portant determining factor in one’s choice of techniques. Many 
measures, especially those based on the standard deviation from the 
mean, are most suitable for distributions which tend to be normal, 
and inappropriate for highly skewed distributions, or for distribu- 
tions of only a few measures, or for scores derived from a linear 
series of ranked order-of-merit positions. The product-moment cor- 
relation coefficient is derived and used on the assumption of a rec- 
tilinear relation between two variables; if the relation tends to be 
curvilinear, then another type of coefficient (the eta) is more ap- 
propriate. A scatter diagram quickly reveals the nature of a co- 
variable trend, if any. Fortunately, most of the co-variable rela- 
tions obtained in psychological measurement tend to be rectilinear 
rather than curvilinear, so that the possible source of error here is 
perhaps not great. 


SUMMARY 


In the foregoing pages we have given an account of some of the 
chief hazards and fallacies of statistical method in psychological 
measurement. We subdivided our presentation into two general 
parts: 

A. The Hazards of Generalizing, and 

B. Some Specific Statistical Fallacies. 


Under A we were concerned with the problem of evaluating 
propositions or generalizations about psychological phenomena in 
the light of such evidence as may be available—a problem of deter- 
mining to what extent the results of measurement are a fair sample 
of all the instances about which one wishes to make a generaliza- 
tion. We developed this problem from the point of view of fair 
population samples, pointed out methods of avoiding sampling 
errors, and described several types of related fallacies. The genetic 
fallacy, anthropomorphism and enelicomorphism were characterized 
as instances of the fallacy of selection. The group-to-individual 
fallacy as well as that of pigeon-holing were also described as ob- 
stacles to valid generalization in psychological measurement. Fin- 
ally, in this first section, we discussed hazards and fallacies of proba- 
bility estimates, describing the fact that statistical inference, as 


a 





390 JOHN GRAY PEATMAN 


reasoning from samples, is a form of probable inference, in contrast 
to necessary inference—our generalizations, therefore, have to be 
statements of probability. In this connection we dealt with the 
concept of a true measure, pointing to its misleading character as 
well as to its value in psychological measurement. 


Under section B, we described several types of relatively specific 
statistical fallacies: 1. The fallacy of over-refined data; 2. fal- 
lacies arising from spurious correlation; 3. the possible fallacy of 
inferring causalty from correlation coefficients; 4. the post hoc ergo 
propter hoc fallacy in psychological statistics; 5. fallacies arising 
from different or changed standards of comparison; and 6. hazards 
arising from the misuse of statistical techniques. 


We do not imply that our account is an exhaustive statement 
of hazards and fallacies in the use of statistical method in psycho- 
logical measurement. However, we believe we have described and 
discussed most of the important ones that demand our awareness 
and vigilant attention if we are to avoid frivolous propositions and 
fruitless generalizations, as well as fallacious statements, when making 
inferences from the data of research. 




















b 


